In this project I plan to work with a dataset called “Students Mental Health Assessments.” This dataset contains a range of information about students, including their age, course of study, gender, stress levels, depression scores, anxiety scores, and more. It’s a comprehensive dataset that can provide insights into the mental health of students. This dataset is collected from Kaggle.Lateron to get better analysis did choose another dataset as well that information is later on in this report
##Reading Data(Importing Data)
# Assuming your file is named "students_mental_health_survey.csv"
file_path <- "students_mental_health_survey.csv"
# Read the CSV file without showing column types
Student_Data <- read_csv(file_path, show_col_types = FALSE)
# Print information about the data
print(Student_Data)
## # A tibble: 7,022 × 20
## Age Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 25 Others Male 3.56 3 3 2
## 2 24 Engineering Female 2.44 0 3 0
## 3 19 Business Female 3.74 4 0 3
## 4 19 Computer Scie… Male 3.65 2 1 0
## 5 18 Business Male 3.4 3 3 4
## 6 21 Medical Female 3.35 2 4 3
## 7 18 Law Male 3.65 2 2 5
## 8 21 Business Female 3.4 0 3 3
## 9 24 Medical Male 3.8 3 2 1
## 10 19 Engineering Female 3.05 2 5 0
## # ℹ 7,012 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## # Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## # Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## # Chronic_Illness <chr>, Financial_Stress <dbl>,
## # Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## # Residence_Type <chr>
# Print the number of rows and columns
cat("Number of rows:", nrow(Student_Data), "\n")
## Number of rows: 7022
cat("Number of columns:", ncol(Student_Data), "\n")
## Number of columns: 20
# Replace empty strings with NA
Student_Data[Student_Data == ""] <- NA
# Remove rows with missing values
Student_Data_without_missing_data <- na.omit(Student_Data)
# Check the dimensions of the cleaned data
dim(Student_Data_without_missing_data)
## [1] 6995 20
#Printing number of rows and columns after Dropping the missing Data
Student_Data_without_missing_data
## # A tibble: 6,995 × 20
## Age Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 25 Others Male 3.56 3 3 2
## 2 24 Engineering Female 2.44 0 3 0
## 3 19 Business Female 3.74 4 0 3
## 4 18 Business Male 3.4 3 3 4
## 5 21 Medical Female 3.35 2 4 3
## 6 18 Law Male 3.65 2 2 5
## 7 21 Business Female 3.4 0 3 3
## 8 24 Medical Male 3.8 3 2 1
## 9 22 Computer Scie… Male 3.19 1 1 3
## 10 27 Medical Male 3.26 3 2 2
## # ℹ 6,985 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## # Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## # Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## # Chronic_Illness <chr>, Financial_Stress <dbl>,
## # Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## # Residence_Type <chr>
cat("Number of rows:", nrow(Student_Data_without_missing_data ), "\n")
## Number of rows: 6995
cat("Number of columns:", ncol(Student_Data_without_missing_data ), "\n")
## Number of columns: 20
I want to know what are the factors that are effecting the stress levels, so drawn plots stress level against remaining (19)features. For easy visualization I choose scatter plots for the continuous variables and box plot for categorical variables
# Scatter Plot of Stress Level and CGPA
scatter_plot_cgpa <- ggplot(Student_Data_without_missing_data, aes(x = CGPA, y = Stress_Level,colour=Course)) +
geom_point() +
labs(title = "Scatter Plot of Stress Level and CGPA", x = "CGPA", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
polar_scater_plot<-ggplot(Student_Data_without_missing_data, aes(x = CGPA, y = Stress_Level, color = CGPA)) +
geom_point() +
labs(title = "Polar Scatter Plot of Stress Level by CGPA", x = "CGPA", y = "Stress_Level") +
theme(plot.title = element_text(hjust = 0.5)) + # Center the title
coord_polar() +
scale_color_viridis_c()
# Display plot
print(scatter_plot_cgpa)
print(polar_scater_plot)
The plots depicts the relationship between stress level and CGPA, with
the CGPA ranging from 2.5 to 4.0 and stress levels on a discrete scale.
The plot shows that individuals with similar CGPAs can have varying
stress levels, indicating no clear trend between CGPA and stress. Data
points are densely populated between CGPAs of 3.0 to 3.5. The absence of
a visible pattern suggests that other factors may influence the
relationship between stress and academic performance.
#Scatter Plot of Stress Level and Age
scatter_plot_age <- ggplot(Student_Data_without_missing_data, aes(x = Age, y = Stress_Level)) +
geom_point() +
labs(title = "Scatter Plot of Stress Level and Age", x = "Age", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(scatter_plot_age)
The scatter plot shows the relationship between stress level and age. Stress level is plotted on the vertical axis, ranging from 0 to 5, and age is on the horizontal axis, ranging from about 20 to 35 years. The plot presents a uniform distribution of stress levels across different ages with no apparent pattern, indicating that within this dataset, age does not seem to significantly affect reported stress levels. Each horizontal line of dots represents all the reported stress levels for a particular age, and every age has a full range of stress levels.
#Scatter Plot of Stress Level and Anxiety Score
scatter_plot_Anxiety_Score <- ggplot(Student_Data_without_missing_data, aes(x = Anxiety_Score, y = Stress_Level)) +
geom_point() +
labs(title = "Scatter Plot of Stress Level and Anxiety Score", x = "Anxiety_Score", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(scatter_plot_Anxiety_Score)
The scatter plot illustrates the relationship between stress level
(vertical axis) and anxiety score (horizontal axis), both seemingly
rated on a scale from 0 to 5. Similar to the previous plots, there’s a
uniform distribution with each level of anxiety having a full range of
stress levels associated with it. This uniformity indicates no clear
correlation between the two variables in the data presented. Each point
likely represents an individual’s reported stress level at a given
anxiety score, and the plot suggests that an individual’s stress level
can vary widely regardless of their anxiety score.
#Scatter Plot of Stress Level and Depression Score
scatter_plot_Depression_Score <- ggplot(Student_Data_without_missing_data, aes(x = Depression_Score, y = Stress_Level)) +
geom_point() +
labs(title = "Scatter Plot of Stress Level and Depression Score", x = "Depression_Score", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(scatter_plot_Depression_Score)
The scatter plot shows the relationship between stress level and
depression score, with both variables appearing to be rated on a scale
from 0 to 5. The plot displays a uniform distribution of stress levels
across the range of depression scores, indicating no clear pattern or
correlation within this dataset. Each depression score has associated
with it a full span of stress levels, suggesting that stress levels are
independent of depression scores in this sample. Each point likely
represents an individual’s stress level for their respective depression
score.
# Boxplot of Stress Level by Gender
boxplot_gender <- ggplot(Student_Data_without_missing_data, aes(x = Gender, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Gender", x = "Gender", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_gender)
The boxplot compares stress levels between two gender groups: female and
male. The stress level, likely on a scale from 0 to 5, is represented on
the vertical axis. Both boxplots have a median stress level around 2,
indicated by the line within the box. The range of stress levels
(interquartile range) for both genders is similar, spanning from about 2
to 4, with the ‘whiskers’ suggesting that there are individuals who
report stress levels slightly outside this range. There are no visible
outliers, and the distribution seems quite symmetrical for both genders,
suggesting a similar variation in stress levels among females and males
within this dataset.
# Boxplot of Stress Level by Course
boxplot_course <- ggplot(Student_Data_without_missing_data, aes(x = Course, y = Stress_Level, fill = Course)) +
geom_boxplot() +
labs(title = "Plot of Stress Level by Course", x = "Course", y = "Stress_Level") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display the plot
print(boxplot_course)
boxplot_course+coord_polar()
The plot displays stress levels by academic course, with medians around
2 for all fields. Variability in stress is similar for Business,
Computer Science, Law, Engineering and Others. The Medical field has a
marginally higher median stress level, but overall, the stress level
distribution is consistent across courses without any outliers.
#Boxplot of Stress level by Sleep Quality
boxplot_sleep_quality <- ggplot(Student_Data_without_missing_data, aes(x = Sleep_Quality, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Sleep Quality", x = "Sleep_Quality", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_sleep_quality)
The boxplot illustrates stress levels categorized by sleep quality,
which is divided into ‘Average’, ‘Good’, and ‘Poor’. The stress level is
consistent across all sleep quality categories with a median around 2.
The interquartile ranges are also similar, indicating that the spread of
stress levels is relatively uniform regardless of sleep quality. There
are no outliers or extreme values, suggesting that within this dataset,
perceived sleep quality does not show a distinct impact on stress
levels.
#Diet_Quality
boxplot_Diet_Quality <- ggplot(Student_Data_without_missing_data, aes(x = Diet_Quality, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Diet Quality", x = "Diet_Quality", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_Diet_Quality)
The boxplot shows stress levels in relation to diet quality, categorized
as ‘Average’, ‘Good’, and ‘Poor’. Stress levels are on a scale on the
vertical axis, and the median stress level is consistent across all diet
categories, situated around the 3 mark. The interquartile ranges, which
represent the middle 50% of responses, are similar for all three diet
qualities, showing no significant variation in stress levels with diet
quality. No outliers are present, indicating a uniform distribution of
stress levels across the dietary spectrum in the dataset.
#Boxplot of Stress Level by Social_Support
boxplot_Social_Support <- ggplot(Student_Data_without_missing_data, aes(x = Social_Support, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Social Support", x = "Social_Support", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_Social_Support)
The boxplot displays stress levels in relation to social support,
categorized as ‘High’, ‘Low’, and ‘Moderate’. The median stress level
appears consistent across all categories, near the value of 2. The
interquartile ranges are similar, indicating a comparable spread of
stress levels regardless of social support level. There are no outliers
or extreme values shown, which suggests that within this dataset, the
perceived level of social support does not have a distinct impact on the
reported stress levels.
#Boxplot of Stress Level by Relationship_Status
boxplot_Relationship_Status <- ggplot(Student_Data_without_missing_data, aes(x = Relationship_Status, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Relationship Status", x = "Relationship_Status", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_Relationship_Status)
The boxplot compares stress levels by relationship status: ‘In a
Relationship’, ‘Married’, and ‘Single’. Stress levels are indicated on
the vertical axis and all three categories show a median stress level
close to 2. The interquartile ranges, representing the middle 50% of
data, are similar across the categories, suggesting a comparable spread
of stress levels regardless of relationship status. The ‘whiskers’ of
the plot indicate the range of the data within 1.5 times the
interquartile range, and they appear to be similar for all groups. There
are no outliers indicated, which suggests that there is no significant
variation in stress levels that can be attributed to relationship status
within this dataset.
#Boxplot of Stress Level by Physical_Activity
boxplot_Physical_Activity <- ggplot(Student_Data_without_missing_data, aes(x = Physical_Activity, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Physical Activity", x = "Physical_Activity", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_Physical_Activity)
The boxplot depicts stress levels based on physical activity,
categorized as ‘High’, ‘Low’, and ‘Moderate’. The median stress level
for each category is approximately at the 2 mark on the vertical axis.
The spread of stress levels, as indicated by the interquartile ranges,
is consistent across all levels of physical activity. There are no
outliers shown, suggesting a uniform distribution of stress levels
within each physical activity group in this dataset.
#Boxplot of Stress Level by Substance_Use
boxplot_Substance_Use <- ggplot(Student_Data_without_missing_data, aes(x = Substance_Use, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Substance Use", x = "Substance_Use", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_Substance_Use)
The boxplot illustrates stress levels in relation to substance use,
categorized as ‘Frequently’, ‘Never’, and ‘Occasionally’. The median
stress level for ‘Never’ and ‘Occasionally’ is about the same, while
‘Frequently’ is slightly higher. The range of stress levels, indicated
by the boxes, is broader for those who ‘Never’ use substances, with
‘Frequently’ showing the narrowest interquartile range. Notably, there
are outliers in the ‘Frequently’ category, indicating individuals with
extremely low stress levels, unlike the other two categories. This
suggests some variations in stress levels associated with substance use
frequency.
#Boxplot of Stress Level by Counseling_Service_Use
boxplot_Counseling_Service_Use= ggplot(Student_Data_without_missing_data, aes(x = Counseling_Service_Use, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Counseling Service Use", x = "Counseling_Service_Use", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_Counseling_Service_Use)
The boxplot depicts stress levels in relation to the use of counseling
services, categorized as ‘Frequently’, ‘Never’, and ‘Occasionally’. The
median stress level is around 2 for those who ‘Never’ or ‘Occasionally’
use counseling services, and slightly higher for those who ‘Frequently’
use them. The interquartile ranges are comparable across all categories,
suggesting a similar spread of stress levels irrespective of counseling
service usage. The ‘whiskers’ extend to show the full range of reported
stress levels, with no significant outliers present, which indicates a
relatively consistent distribution of stress levels across the frequency
of counseling service use within this dataset.
#Boxplot of Stress Level by Family_History
boxplot_Family_History <- ggplot(Student_Data_without_missing_data, aes(x = Family_History, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Family History", x = "Family_History", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_Family_History)
The boxplot shows stress levels categorized by family history of mental
health issues, with ‘No’ on the left and ‘Yes’ on the right. The median
stress level for those with no family history is around 2.5, and for
those with a family history, it is slightly higher, near 3. The
interquartile range is slightly wider for those with a family history,
indicating a greater spread in stress levels. Both categories have a
similar overall range of stress levels, as indicated by the ‘whiskers’.
There are no significant outliers in either category, suggesting that
while there may be a slight difference in median stress levels, the
overall distribution is fairly consistent regardless of family
history.
#Boxplot of Stress Level by Chronic_Illness
boxplot_Chronic_Illness <- ggplot(Student_Data_without_missing_data, aes(x = Chronic_Illness, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Chronic Illness", x = "Chronic_Illness", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(boxplot_Chronic_Illness)
The boxplot compares stress levels based on the presence of a chronic
illness, with categories ‘No’ and ‘Yes’. Those without a chronic illness
have a median stress level slightly above 2, while those with a chronic
illness have a median stress level near 3. The interquartile ranges,
indicating the middle 50% of data, are similar for both groups,
suggesting a comparable variability in stress levels. The ‘whiskers’
show the full range of stress levels within each category, with no
significant outliers. This indicates that while the median stress level
is slightly higher for those with a chronic illness, the overall
distribution of stress levels is relatively consistent between the two
groups.
#Scatter Plot of Stress Level and Financial_Stress
scatter_Financial_Stress <- ggplot(Student_Data_without_missing_data, aes(x = Financial_Stress, y = Stress_Level)) +
geom_point() +
labs(title = "Scatter Plot of Stress Level and Financial Stress", x = "Financial_Stress", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(scatter_Financial_Stress)
The scatter plot illustrates the relationship between stress level and
financial stress, with financial stress on the horizontal axis and
stress level on the vertical axis, both seemingly rated on a scale from
0 to 5. The plot shows that for each level of financial stress, there is
a full range of reported stress levels, suggesting that within this
dataset, there is no clear trend indicating that financial stress
directly correlates with overall stress level. Individuals report a wide
variety of stress levels at each point of financial stress, indicating
variability in the impact of financial stress on overall stress
levels.
#Boxplot of Stress Level by Extracurricular_Involvement
boxplot_Extracurricular_Involvement <- ggplot(Student_Data_without_missing_data, aes(x = Extracurricular_Involvement, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Extracurricular Involvement", x = "Extracurricular_Involvement", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plots
print(boxplot_Extracurricular_Involvement)
The boxplot compares stress levels based on the extent of extracurricular involvement, with categories ‘High’, ‘Low’, and ‘Moderate’. Two categories have similar median stress levels, around the 2 mark,while High category have median around 3. The ‘whiskers’ of the plot extend to the full range of reported stress levels within each category, and no outliers are present. This indicates that within this dataset, the level of extracurricular involvement does show a distinct impact on reported stress levels.
#Scatter Plot of Stress level and Semester_Credit_Load
scatter_Semester_Credit_Load <- ggplot(Student_Data_without_missing_data, aes(x = Semester_Credit_Load, y = Stress_Level)) +
geom_point() +
labs(title = "Scatter Plot of Stress level and Semester Credit Load", x = "Semester_Credit_Load", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plot
print(scatter_Semester_Credit_Load)
The scatter plot displays the relationship between stress level and
semester credit load. Stress level is on the vertical axis, and semester
credit load is on the horizontal axis, ranging from approximately 15
credits. Each dot represents an individual’s reported stress level at a
particular credit load.The plot shows a uniform distribution across
different credit loads, with each level of credit load having a full
range of stress levels from 0 to 5. This indicates no clear correlation
between the number of credits a student is taking and their reported
stress level. Students with low, moderate, and high credit loads report
a similar range of stress levels, suggesting that factors other than
credit load alone may influence stress levels.
#Boxplot of Stress Level by Extracurricular_Involvement
boxplot_Residence_Type <- ggplot(Student_Data_without_missing_data, aes(x = Residence_Type, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Residence Type", x = "Residence_Type", y = "Stress_Level")+
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Display plots
print(boxplot_Residence_Type)
The boxplot illustrates stress levels among individuals based on their type of residence: ‘Off-Campus’, ‘On-Campus’, and ‘With Family’. The median stress level for off-campus and on-campus residence types is similar, hovering with family marks 3. The ‘Off-Campus’ and ‘On-Campus’ categories have comparable interquartile ranges. The whiskers, which represent the range of the data, extend similarly across all types, and no significant outliers are present. This indicates that residence type do have a major impact on stress levels within this dataset.
From the results I observed that these are the factors namely:Course,Substance Use,Counseling service used,Family history,Chronic illness,Extracurricular Involvement,Residence type effect the stress levels in student (This is based on the dataset I am using )
##Data Transformation
Using filter() to filter the data so that I will have the data that has only effects the stress levels .
filtered_data <- Student_Data_without_missing_data %>%
filter(
Substance_Use=="Frequently" |
Counseling_Service_Use=="Frequently" |
Family_History=="Yes" |
Chronic_Illness=="Yes" |
Extracurricular_Involvement=="High" |
Residence_Type=="With Family"
)
filtered_data
## # A tibble: 4,498 × 20
## Age Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 19 Business Female 3.74 4 0 3
## 2 18 Business Male 3.4 3 3 4
## 3 21 Business Female 3.4 0 3 3
## 4 24 Medical Male 3.8 3 2 1
## 5 22 Computer Scie… Male 3.19 1 1 3
## 6 27 Medical Male 3.26 3 2 2
## 7 25 Law Male 3.61 3 1 5
## 8 18 Medical Female 3.85 4 1 3
## 9 19 Medical Male 3.26 5 1 1
## 10 22 Computer Scie… Male 3.46 3 1 0
## # ℹ 4,488 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## # Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## # Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## # Chronic_Illness <chr>, Financial_Stress <dbl>,
## # Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## # Residence_Type <chr>
I wanted to have the data that only have the specific features but when I used “&” it resulted as 0 rows. so went ahead and used “|” operator so that i will have the data that cummulatively will have the features that I desire to examine. filtered_data is the new data that has filtered data
#Arrange the data based on CGPA in descending order
arranged_data <- arrange(filtered_data, desc(CGPA))
arranged_data
## # A tibble: 4,498 × 20
## Age Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 19 Law Male 4 5 0 1
## 2 24 Business Female 4 2 1 0
## 3 28 Law Female 4 0 1 4
## 4 25 Computer Scie… Female 4 0 3 3
## 5 25 Law Male 4 1 4 0
## 6 27 Engineering Female 4 4 4 2
## 7 21 Computer Scie… Male 4 1 1 2
## 8 20 Law Male 4 2 2 1
## 9 18 Engineering Male 4 3 0 3
## 10 20 Law Male 4 4 2 3
## # ℹ 4,488 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## # Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## # Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## # Chronic_Illness <chr>, Financial_Stress <dbl>,
## # Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## # Residence_Type <chr>
arranged_data is the data that have the CGPA arranged in Descending Order.But it is changing the cgpa values as from the filtered_data, I see that the CGPA varies from 4.00 - 2.60 in filtered_data but in the arranged_data the CGPA varies from 4.00-3.73.
arranged_data_continuous <- arrange(filtered_data,Course)
arranged_data_continuous
## # A tibble: 4,498 × 20
## Age Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 19 Business Female 3.74 4 0 3
## 2 18 Business Male 3.4 3 3 4
## 3 21 Business Female 3.4 0 3 3
## 4 23 Business Female 3.7 0 1 2
## 5 24 Business Female 4 2 1 0
## 6 23 Business Male 3.75 3 2 2
## 7 18 Business Female 3.71 0 1 2
## 8 28 Business Female 3.12 4 1 2
## 9 24 Business Male 3.64 2 3 1
## 10 29 Business Female 3.12 1 5 2
## # ℹ 4,488 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## # Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## # Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## # Chronic_Illness <chr>, Financial_Stress <dbl>,
## # Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## # Residence_Type <chr>
Even I checked arrange() with continuous variable “Course” even here it is giving the courses only Business and Computer Science. I am not sure what is wrong for now.I tried the way Dr. Swan still unable to get the correct format.
#Arranging columns with select() select() pulls out only the named columns. Some datasets come with a huge number of variables, and it can be helpful to only focus on those of immediate interest:
select(filtered_data, "Course",
"Stress_Level",
"Substance_Use",
"Counseling_Service_Use",
"Family_History",
"Chronic_Illness",
"Extracurricular_Involvement",
"Residence_Type"
)
## # A tibble: 4,498 × 8
## Course Stress_Level Substance_Use Counseling_Service_Use Family_History
## <chr> <dbl> <chr> <chr> <chr>
## 1 Business 4 Never Occasionally No
## 2 Business 3 Never Never No
## 3 Business 0 Never Never Yes
## 4 Medical 3 Frequently Never Yes
## 5 Computer Sc… 1 Never Occasionally No
## 6 Medical 3 Never Occasionally No
## 7 Law 3 Never Never Yes
## 8 Medical 4 Never Never Yes
## 9 Medical 5 Never Never Yes
## 10 Computer Sc… 3 Never Frequently No
## # ℹ 4,488 more rows
## # ℹ 3 more variables: Chronic_Illness <chr>, Extracurricular_Involvement <chr>,
## # Residence_Type <chr>
Renaming the variables using rename()
rename(filtered_data, Course = Course, Stress_Points = Stress_Level,Credit_Load=Semester_Credit_Load,Counseling_Service =Counseling_Service_Use)
## # A tibble: 4,498 × 20
## Age Course Gender CGPA Stress_Points Depression_Score Anxiety_Score
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 19 Business Female 3.74 4 0 3
## 2 18 Business Male 3.4 3 3 4
## 3 21 Business Female 3.4 0 3 3
## 4 24 Medical Male 3.8 3 2 1
## 5 22 Computer Sci… Male 3.19 1 1 3
## 6 27 Medical Male 3.26 3 2 2
## 7 25 Law Male 3.61 3 1 5
## 8 18 Medical Female 3.85 4 1 3
## 9 19 Medical Male 3.26 5 1 1
## 10 22 Computer Sci… Male 3.46 3 1 0
## # ℹ 4,488 more rows
## # ℹ 13 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## # Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## # Substance_Use <chr>, Counseling_Service <chr>, Family_History <chr>,
## # Chronic_Illness <chr>, Financial_Stress <dbl>,
## # Extracurricular_Involvement <chr>, Credit_Load <dbl>, Residence_Type <chr>
#Adding new variables with mutate()
mutated_data <- filtered_data %>% mutate(Total_Score = Depression_Score + Anxiety_Score)
mutated_data
## # A tibble: 4,498 × 21
## Age Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 19 Business Female 3.74 4 0 3
## 2 18 Business Male 3.4 3 3 4
## 3 21 Business Female 3.4 0 3 3
## 4 24 Medical Male 3.8 3 2 1
## 5 22 Computer Scie… Male 3.19 1 1 3
## 6 27 Medical Male 3.26 3 2 2
## 7 25 Law Male 3.61 3 1 5
## 8 18 Medical Female 3.85 4 1 3
## 9 19 Medical Male 3.26 5 1 1
## 10 22 Computer Scie… Male 3.46 3 1 0
## # ℹ 4,488 more rows
## # ℹ 14 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## # Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## # Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## # Chronic_Illness <chr>, Financial_Stress <dbl>,
## # Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## # Residence_Type <chr>, Total_Score <dbl>
As after mutating there are 2 rows combined to one so I an going to drop those columns:
# Drop multiple columns
After_Deleting_Columns<- select(mutated_data, -Depression_Score,-Anxiety_Score)
After_Deleting_Columns
## # A tibble: 4,498 × 19
## Age Course Gender CGPA Stress_Level Sleep_Quality Physical_Activity
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 19 Business Female 3.74 4 Good Low
## 2 18 Business Male 3.4 3 Good Low
## 3 21 Business Female 3.4 0 Average Low
## 4 24 Medical Male 3.8 3 Poor Low
## 5 22 Computer Sci… Male 3.19 1 Average Moderate
## 6 27 Medical Male 3.26 3 Average Moderate
## 7 25 Law Male 3.61 3 Good Low
## 8 18 Medical Female 3.85 4 Good Low
## 9 19 Medical Male 3.26 5 Good Low
## 10 22 Computer Sci… Male 3.46 3 Good Moderate
## # ℹ 4,488 more rows
## # ℹ 12 more variables: Diet_Quality <chr>, Social_Support <chr>,
## # Relationship_Status <chr>, Substance_Use <chr>,
## # Counseling_Service_Use <chr>, Family_History <chr>, Chronic_Illness <chr>,
## # Financial_Stress <dbl>, Extracurricular_Involvement <chr>,
## # Semester_Credit_Load <dbl>, Residence_Type <chr>, Total_Score <dbl>
##After filtering and mutating visualizing the relation among stresslevel and other factors
# Explore factors influencing stress levels
# Scatter plot for numerical variables
facet_grid_plot <- ggplot(data = After_Deleting_Columns) +
geom_point(mapping = aes(x = CGPA, y = Stress_Level, color = Course)) +
facet_grid(rows = vars(Gender), cols = vars(Course), labeller = label_both)
facet_grid<-ggplot(data = Student_Data) +
geom_point(mapping = aes(x = CGPA, y = Stress_Level, color = Gender)) +
facet_grid(cols = vars(Gender))
# Boxplot for categorical variables
boxplot_gender <- ggplot(After_Deleting_Columns, aes(x = Gender, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Gender", x = "Gender", y = "Stress Level")
boxplot_course <- ggplot(After_Deleting_Columns, aes(x = Course, y = Stress_Level)) +
geom_boxplot() +
labs(title = "Boxplot of Stress Level by Course", x = "Course", y = "Stress Level") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Display plots
print(facet_grid_plot)
print(facet_grid)
## Warning: Removed 12 rows containing missing values (`geom_point()`).
print(boxplot_gender)
print(boxplot_course)
Still I find no difference behaviour of variables.
#Finding Average of CGPA
# Summarize the data to calculate average CGPA
avg_cgpa <- After_Deleting_Columns %>% summarize(Avg_CGPA = mean(CGPA, na.rm = TRUE))
avg_cgpa
## # A tibble: 1 × 1
## Avg_CGPA
## <dbl>
## 1 3.49
The average CGPA in the dataset is approximately 3.49.This suggests a moderate overall academic performance among the students. The CGPA is a key metric reflecting students’ academic success.
avg_stress_level <-After_Deleting_Columns%>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
avg_stress_level
## # A tibble: 1 × 1
## Avg_Stress
## <dbl>
## 1 2.46
The average stress level in the dataset is around 2.46.This indicates a moderate level of stress among the students.Stress level is an essential metric reflecting the mental well-being of the students.
grouped_data_by_gender_cgpa <- mutated_data %>% group_by(Gender) %>% summarize(Avg_CGPA = mean(CGPA, na.rm = TRUE))
grouped_data_by_gender_cgpa
## # A tibble: 2 × 2
## Gender Avg_CGPA
## <chr> <dbl>
## 1 Female 3.49
## 2 Male 3.48
The data suggests a marginal difference in CGPA between female and male students
grouped_data_by_gender<- mutated_data %>% group_by(Gender) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_gender
## # A tibble: 2 × 2
## Gender Avg_Stress
## <chr> <dbl>
## 1 Female 2.44
## 2 Male 2.47
The data suggests a small difference in stress levels between female and male students.
grouped_data_by_Sleep_Quality <- mutated_data %>% group_by(Sleep_Quality) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Sleep_Quality
## # A tibble: 3 × 2
## Sleep_Quality Avg_Stress
## <chr> <dbl>
## 1 Average 2.47
## 2 Good 2.45
## 3 Poor 2.44
Students with an “Average” sleep quality have an average stress level of approximately 2.47. Those with a “Good” sleep quality have a slightly lower average stress level of around 2.45. Students with a “Poor” sleep quality exhibit the lowest average stress level of approximately 2.44
grouped_data_by_Course <- mutated_data %>% group_by(Course) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Course
## # A tibble: 6 × 2
## Course Avg_Stress
## <chr> <dbl>
## 1 Business 2.14
## 2 Computer Science 2.08
## 3 Engineering 2.15
## 4 Law 2.19
## 5 Medical 3.23
## 6 Others 2.08
From the table it is observed that average stress is more for the Medical students
grouped_data_by_Chronic_Illness <- mutated_data %>% group_by(Chronic_Illness) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Chronic_Illness
## # A tibble: 2 × 2
## Chronic_Illness Avg_Stress
## <chr> <dbl>
## 1 No 2.45
## 2 Yes 2.56
The students with Chronic_Illness is 0.1 are having more stress level compared to the students that don’t have
grouped_data_by_Residence_Type <- mutated_data %>% group_by(Residence_Type) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Residence_Type
## # A tibble: 3 × 2
## Residence_Type Avg_Stress
## <chr> <dbl>
## 1 Off-Campus 2.47
## 2 On-Campus 2.42
## 3 With Family 2.49
From plots when it is observed it should the Residence_Type of with family have more stress levels but from the average values it is clear that the off-campus and with family students are similar stress level
grouped_data_by_Extracurricular_Involvement <- mutated_data %>% group_by(Extracurricular_Involvement) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Extracurricular_Involvement
## # A tibble: 3 × 2
## Extracurricular_Involvement Avg_Stress
## <chr> <dbl>
## 1 High 2.45
## 2 Low 2.46
## 3 Moderate 2.46
I am shocked by seeing the average stress values with respect to Extracurricular_Involvement as from the plots it displays that people with high Extracurricular_Involvement has more stress level . I guess as we are seeing the average of stress_level ,the results are differing.
grouped_data_by_Substance_Use <- mutated_data %>% group_by(Substance_Use) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Substance_Use
## # A tibble: 3 × 2
## Substance_Use Avg_Stress
## <chr> <dbl>
## 1 Frequently 2.44
## 2 Never 2.46
## 3 Occasionally 2.46
Here there is slight that is by 0.01 is difference between than avg stress level among Substance use
grouped_data_by_Counseling_Service_Use <- mutated_data %>% group_by(Counseling_Service_Use) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Counseling_Service_Use
## # A tibble: 3 × 2
## Counseling_Service_Use Avg_Stress
## <chr> <dbl>
## 1 Frequently 2.54
## 2 Never 2.43
## 3 Occasionally 2.46
Here the students that have frequently used counselling service have 0.1 more average stress level
grouped_data_by_Family_History <- mutated_data %>% group_by(Family_History) %>% summarize(Avg_Stress = mean(Stress_Level, na.rm = TRUE))
grouped_data_by_Family_History
## # A tibble: 2 × 2
## Family_History Avg_Stress
## <chr> <dbl>
## 1 No 2.44
## 2 Yes 2.48
The students with family history have 0.04 more average stress level
mutated_data
## # A tibble: 4,498 × 21
## Age Course Gender CGPA Stress_Level Depression_Score Anxiety_Score
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 19 Business Female 3.74 4 0 3
## 2 18 Business Male 3.4 3 3 4
## 3 21 Business Female 3.4 0 3 3
## 4 24 Medical Male 3.8 3 2 1
## 5 22 Computer Scie… Male 3.19 1 1 3
## 6 27 Medical Male 3.26 3 2 2
## 7 25 Law Male 3.61 3 1 5
## 8 18 Medical Female 3.85 4 1 3
## 9 19 Medical Male 3.26 5 1 1
## 10 22 Computer Scie… Male 3.46 3 1 0
## # ℹ 4,488 more rows
## # ℹ 14 more variables: Sleep_Quality <chr>, Physical_Activity <chr>,
## # Diet_Quality <chr>, Social_Support <chr>, Relationship_Status <chr>,
## # Substance_Use <chr>, Counseling_Service_Use <chr>, Family_History <chr>,
## # Chronic_Illness <chr>, Financial_Stress <dbl>,
## # Extracurricular_Involvement <chr>, Semester_Credit_Load <dbl>,
## # Residence_Type <chr>, Total_Score <dbl>
##Modeling the Data
# Create a scatter plot
scatter_plot <- ggplot(mutated_data, aes(x = CGPA, y = Stress_Level)) +
geom_point() +
labs(title = "Scatter Plot of Stress Level and CGPA", x = "CGPA", y = "Stress_Level") +
theme(plot.title = element_text(hjust = 0.5))
# Add the linear regression line
linear_model_plot <- scatter_plot +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Linear Model: Stress Level ~ CGPA")
# Display the plot
print(linear_model_plot)
## `geom_smooth()` using formula = 'y ~ x'
# Create a scatter plot
scatter_plot <- ggplot(mutated_data, aes(x = Total_Score, y = Stress_Level)) +
geom_point() +
labs(title = "Scatter Plot of Stress Level and Total_Score", x = "Total_Score", y = "Stress_Level") +
theme(plot.title = element_text(hjust = 0.5))
# Add the linear regression line
linear_model_plot <- scatter_plot +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Linear Model: Stress Level ~ Total_Score")
# Display the plot
print(linear_model_plot)
## `geom_smooth()` using formula = 'y ~ x'
# Assuming linear_model is the linear model object
linear_model <- lm(Stress_Level ~ Total_Score, data = mutated_data)
# Print the summary of the linear model
summary(linear_model)
##
## Call:
## lm(formula = Stress_Level ~ Total_Score, data = mutated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.639 -1.439 -0.240 1.481 2.760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.63873 0.05414 48.741 < 2e-16 ***
## Total_Score -0.03987 0.01064 -3.747 0.000181 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.633 on 4496 degrees of freedom
## Multiple R-squared: 0.003113, Adjusted R-squared: 0.002892
## F-statistic: 14.04 on 1 and 4496 DF, p-value: 0.0001811
From the above outputs I think lm model is not good for my dataset as there are no features that show linear relationship. As you can see there is sifnificantly low r squred value
##Modeling using Machine Learning First using Random Forest to know what are the features are important in my data
model_rf <- randomForest(Stress_Level ~ ., data = mutated_data)
importance(model_rf)
## IncNodePurity
## Age 974.3391
## Course 1332.3462
## Gender 216.5634
## CGPA 1441.9538
## Depression_Score 526.2089
## Anxiety_Score 535.2441
## Sleep_Quality 346.2841
## Physical_Activity 355.1167
## Diet_Quality 338.6833
## Social_Support 355.4438
## Relationship_Status 337.2928
## Substance_Use 285.6439
## Counseling_Service_Use 353.6239
## Family_History 210.6777
## Chronic_Illness 124.8738
## Financial_Stress 669.6823
## Extracurricular_Involvement 371.4175
## Semester_Credit_Load 1006.6853
## Residence_Type 361.0957
## Total_Score 670.1209
The variable importance scores, measured by IncNodePurity, provide insights into the features contributing significantly to stress levels in the dataset. Notably, CGPA emerges as the most influential, followed by Total_Score and Semester_Credit_Load, suggesting a strong connection between academic performance and stress. Additional impactful factors include age and course type. Interestingly, extracurricular involvement and residence type exhibit moderate importance, emphasizing the multifaceted nature of stress determinants. These insights can guide further exploration and interventions to mitigate stress, particularly among students.
Trying to do boxplot the output of the model
# Assuming you stored your random forest model in a variable called 'rf_model'
# Extract variable importance
# Extract variable importance
#importance_values <- importance(model_rf, type = 1)[, "IncNodePurity"]
# Create a data frame
#var_importance <- data.frame(
#Variable = names(importance_values),
#Importance = importance_values
#)
# Create a bar plot
#ggplot(var_importance, aes(x = reorder(Variable, -Importance), y = Importance)) +
#geom_bar(stat = "identity", fill = "skyblue", color = "black") +
#labs(title = "Variable Importance in Random Forest Model",
# x = "Variable",
# y = "Importance") +
# theme(axis.text)
I am getting an Error in importance(model_rf, type = 1)[, “IncNodePurity”] : subscript out of bounds. Tried multiple ways to resolve it as IncNodePurity is in 0 position but R doesn’t consider 0 position when I tried giving 1 it pops up error
##Modeling My data for prediction purposes
# Step 1: Data Splitting
set.seed(123)
splitIndex <- createDataPartition(mutated_data$Stress_Level, p = 0.7, list = FALSE)
train_data <- mutated_data[splitIndex, ]
test_data <- mutated_data[-splitIndex, ]
# Step 2: Data Preprocessing (apply as needed)
# Step 3: Model Selection (multinomial logistic regression)
multinom_model <- multinom(Stress_Level ~ ., data = train_data)
## # weights: 210 (170 variable)
## initial value 5645.834088
## iter 10 value 5510.678407
## iter 20 value 5435.817565
## iter 30 value 5407.313965
## iter 40 value 5385.346316
## iter 50 value 5380.602475
## iter 60 value 5378.515633
## iter 70 value 5378.060746
## iter 80 value 5377.841727
## iter 90 value 5377.752102
## iter 100 value 5377.728107
## final value 5377.728107
## stopped after 100 iterations
# Step 4: Model Training (already done within the multinom function)
# Step 5: Model Evaluation
multinom_predictions <- predict(multinom_model, newdata = test_data)
# Check and set levels for multinom_predictions
unique_levels_multinom <- unique(multinom_predictions)
unique_levels_test <- unique(test_data$Stress_Level)
# Identify mismatched levels
missing_levels <- setdiff(unique_levels_test, unique_levels_multinom)
extra_levels <- setdiff(unique_levels_multinom, unique_levels_test)
# Remove extra levels from multinom_predictions
multinom_predictions <- factor(multinom_predictions, levels = unique_levels_multinom)
# Add missing levels to multinom_predictions
if (length(missing_levels) > 0) {
multinom_predictions <- factor(multinom_predictions, levels = c(unique_levels_multinom, missing_levels))
}
# Remove extra levels from test_data$Stress_Level
test_data$Stress_Level <- factor(test_data$Stress_Level, levels = unique_levels_test)
# Add missing levels to test_data$Stress_Level
if (length(extra_levels) > 0) {
test_data$Stress_Level <- factor(test_data$Stress_Level, levels = c(unique_levels_test, extra_levels))
}
# Now run confusionMatrix
confusionMatrix(multinom_predictions, test_data$Stress_Level)
## Warning in confusionMatrix.default(multinom_predictions,
## test_data$Stress_Level): Levels are not in the same order for reference and
## data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 3 1 2 4 5
## 0 45 33 31 44 21 20
## 3 39 65 49 41 43 39
## 1 52 57 63 56 29 20
## 2 62 54 56 47 20 37
## 4 13 42 17 25 56 51
## 5 5 25 8 23 31 28
##
## Overall Statistics
##
## Accuracy : 0.2257
## 95% CI : (0.2036, 0.249)
## No Information Rate : 0.2049
## P-Value [Acc > NIR] : 0.0328110
##
## Kappa : 0.0665
##
## Mcnemar's Test P-Value : 0.0002106
##
## Statistics by Class:
##
## Class: 0 Class: 3 Class: 1 Class: 2 Class: 4 Class: 5
## Sensitivity 0.20833 0.23551 0.28125 0.19915 0.28000 0.14359
## Specificity 0.86826 0.80299 0.80944 0.79388 0.87097 0.92014
## Pos Pred Value 0.23196 0.23551 0.22744 0.17029 0.27451 0.23333
## Neg Pred Value 0.85169 0.80299 0.84953 0.82353 0.87402 0.86390
## Prevalence 0.16036 0.20490 0.16630 0.17520 0.14848 0.14477
## Detection Rate 0.03341 0.04826 0.04677 0.03489 0.04157 0.02079
## Detection Prevalence 0.14402 0.20490 0.20564 0.20490 0.15145 0.08909
## Balanced Accuracy 0.53830 0.51925 0.54534 0.49652 0.57548 0.53186
Visualizing the Confusion matrix
# Confusion Matrix
conf_matrix <- matrix(c(32, 37, 31, 33, 12, 17,
66, 80, 69, 78, 65, 57,
44, 64, 51, 56, 36, 26,
52, 45, 47, 30, 30, 42,
9, 20, 13, 15, 21, 24,
13, 30, 13, 24, 36, 29), nrow = 6, byrow = TRUE)
colnames(conf_matrix) <- c("0", "3", "1", "2", "4", "5")
rownames(conf_matrix) <- c("0", "3", "1", "2", "4", "5")
# Convert to data frame for ggplot
conf_df <- as.data.frame(as.table(conf_matrix))
conf_df$Var1 <- factor(conf_df$Var1, levels = c("0", "3", "1", "2", "4", "5"))
conf_df$Var2 <- factor(conf_df$Var2, levels = c("0", "3", "1", "2", "4", "5"))
# Plot Confusion Matrix
ggplot(conf_df, aes(x = Var1, y = Var2, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "blue") +
theme_minimal() +
labs(title = "Confusion Matrix", x = "Reference", y = "Prediction")
# Plot Overall Statistics
overall_stats <- data.frame(
Metric = c("Accuracy", "Kappa", "Sensitivity", "Specificity"),
Value = c(0.1804, 0.006, 0.14815, 0.88506) # Replace with actual values
)
ggplot(overall_stats, aes(x = Metric, y = Value)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme_minimal() +
labs(title = "Overall Statistics", y = "Value")
The confusion matrix and statistics suggest that the model has low accuracy (0.1804). Sensitivity, specificity, and other metrics for each class are also provided. The Kappa value is close to zero, indicating poor agreement. McNemar’s Test has a very low p-value, suggesting a significant difference between predicted and actual values. Class-specific metrics show varied performance across different stress levels.
So i am still working on using other models so that there will be rise in accuracy. I am not sure but tried many ways but couldn’t improve it.In future I will try explore other datasets .
I tried using other dataset from the kaggle: #Student Stress Factors: A Comprehensive Analysis(Understanding the Underlying Causes and Their Impact on Today’s Students.) For this before exploring to my analysis I tried to answer the questions that are published by the auther of the dataset. I felt if i am able to answer those question then that would be helpful in analysing my analysis ##Reading Data(Importing Data)
# Assuming your file is named "students_mental_health_survey.csv"
file_path <- "StressLevelDataset.csv"
# Read the CSV file without showing column types
Student_Data_2 <- read_csv(file_path, show_col_types = FALSE)
# Print information about the data
print(Student_Data_2)
## # A tibble: 1,100 × 21
## anxiety_level self_esteem mental_health_history depression headache
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14 20 0 11 2
## 2 15 8 1 15 5
## 3 12 18 1 14 2
## 4 16 12 1 15 4
## 5 16 28 0 7 2
## 6 20 13 1 21 3
## 7 4 26 0 6 1
## 8 17 3 1 22 4
## 9 13 22 1 12 3
## 10 6 8 0 27 4
## # ℹ 1,090 more rows
## # ℹ 16 more variables: blood_pressure <dbl>, sleep_quality <dbl>,
## # breathing_problem <dbl>, noise_level <dbl>, living_conditions <dbl>,
## # safety <dbl>, basic_needs <dbl>, academic_performance <dbl>,
## # study_load <dbl>, teacher_student_relationship <dbl>,
## # future_career_concerns <dbl>, social_support <dbl>, peer_pressure <dbl>,
## # extracurricular_activities <dbl>, bullying <dbl>, stress_level <dbl>
# Print the number of rows and columns
cat("Number of rows:", nrow(Student_Data_2), "\n")
## Number of rows: 1100
cat("Number of columns:", ncol(Student_Data_2), "\n")
## Number of columns: 21
Overview of dataset:
Anxiety : range from 0 to 21, Measure : GAD-7 Self-esteem : range 0 to 30, Measure: Rosenberg Self Esteem Scale Mental Health History : 0 if no mental health history, 1 if mental health history Depression : range 0 to 27, Measure: Patient Health Questionnaire (PHQ-9) Other features mostly range from 0 to 5 considering 0,1 to be low, 2,3 to be mid, and 4,5 to be high. Blood pressure: range 1 to 3 (There was no direct explanation for this one, I assume 1 is low, 2 is normal, 3 is high) Stress level: range 0 to 2 (Again no explanation, I assume, 0 low level/absent, 1 medium level, 2 high level) *Generally lower score is better, but for others it is hard to decide. For example sleep quality, we can have an educated guess if we do some data analysis.
//Taken from the kaggle discussion about the dataset as I was short of time tried to learn from the discussion. (https://www.kaggle.com/datasets/rxnach/student-stress-factors-a-comprehensive-analysis)
This dataset contains around 20 features that create the most impact on the Stress of a Student. The features are selected scientifically considering 5 major factors, they are Psychological, Physiological, Social, Environmental, and Academic Factors. Some of them are: Psychological Factors => ‘anxiety_level’, ‘self_esteem’, ‘mental_health_history’, ‘depression’, Physiological Factors => ‘headache’, ‘blood_pressure’, ‘sleep_quality’, ‘breathing_problem Environmental Factors => ’noise_level’, ‘living_conditions’, ‘safety’, ‘basic_needs’, Academic Factors => ‘academic_performance’, ‘study_load’, ‘teacher_student_relationship’, ‘future_career_concerns’, Social Factor => ‘social_support’, ‘peer_pressure’, ‘extracurricular_activities’, ‘bullying’
Trying to answer the following questions:
##Descriptive Statistics:
number_of_students <- nrow(Student_Data_2)
# Print the result
print(number_of_students)
## [1] 1100
There are total 1100 students in this dataset.
# Replace 'anxiety_level' with the actual column name for anxiety level
average_anxiety <- mean(Student_Data_2$anxiety_level, na.rm = TRUE)
# Print the result
print(average_anxiety)
## [1] 11.06364
Average Anxiety level is 11.06
# Replace 'mental_health_history' with the actual column name for mental health history
students_with_mental_health_history <- sum(Student_Data_2$mental_health_history == 1, na.rm = TRUE)
# Print the result
print(students_with_mental_health_history)
## [1] 542
542 students have reported a history of mental health issues
##Psychological Factors:
# Replace 'self_esteem' with the actual column name for self-esteem
self_esteem_average <- mean(Student_Data_2$self_esteem, na.rm = TRUE)
# Count the number of students with self-esteem below the average
students_below_average_self_esteem <- sum(Student_Data_2$self_esteem < self_esteem_average, na.rm = TRUE)
# Print the result
print(students_below_average_self_esteem)
## [1] 507
507 students have a self-esteem level below the average?
# Replace 'depression' with the actual column name for depression
students_with_depression <- sum(Student_Data_2$depression == 1, na.rm = TRUE)
# Total number of students
total_students <- nrow(Student_Data_2)
# Calculate the percentage
percentage_with_depression <- (students_with_depression / total_students) * 100
# Print the result
print(percentage_with_depression)
## [1] 3.272727
3.27 is the percentage of students have reported experiencing depression
##Physiological Factors:
students_with_frequent_headaches <- sum(Student_Data_2$headache == 5, na.rm = TRUE)
# Print the result
print(students_with_frequent_headaches)
## [1] 129
Here the frequency count is taken as 5 . There are 129 students experiencing headaches frequently.
average_blood_pressure <- mean(Student_Data_2$blood_pressure, na.rm = TRUE)
# Print the result
print(average_blood_pressure)
## [1] 2.181818
Average blood pressure reading among the students is 2.18
# Replace 1 with the value that indicates poor sleep quality in your dataset
students_with_poor_sleep_quality <- sum(Student_Data_2$sleep_quality == 1, na.rm = TRUE)
# Print the result
print(students_with_poor_sleep_quality)
## [1] 328
Here I consider 1 as the least sleep quality measure. Students rate their sleep quality as poor are 328
##Environmental Factors:
students_with_high_noise_levels <- sum(Student_Data_2$noise_level == 5, na.rm = TRUE)
# Print the result
print(students_with_high_noise_levels)
## [1] 137
137 Students live in conditions with high noise level
# Count the occurrences of students feeling unsafe
unsafe_students_count <- sum(Student_Data_2$safety == 5, na.rm = TRUE)
# Total number of students
total_students <- nrow(Student_Data_2)
# Calculate the percentage
percentage_feeling_unsafe <- (unsafe_students_count / total_students) * 100
print(percentage_feeling_unsafe)
## [1] 15.54545
15.54% of students feel unsafe in their living conditions
unmet_basic_needs_count <- sum(Student_Data_2$basic_needs == 5, na.rm = TRUE)
# Print the result
print(unmet_basic_needs_count)
## [1] 189
189 students have reported not having their basic needs met
##Academic Factors:
below_average_performance_count <- sum(Student_Data_2$academic_performance < 2, na.rm = TRUE)
# Print the result
print(below_average_performance_count)
## [1] 213
As the academic performance range from 1-5 thought 2 can be considered as average. Students rate their academic performance as below average are 213
average_study_load <- mean(Student_Data_2$study_load, na.rm = TRUE)
# Print the result
print(average_study_load)
## [1] 2.621818
Average study load reported by students is 2.62
career_concerns_count <- sum(Student_Data_2$future_career_concerns == 3, na.rm = TRUE)
# Print the result
print(career_concerns_count)
## [1] 173
There are 173 students those concerns about their future career
##Social Factors:
strong_social_support_count <- sum(Student_Data_2$social_support >= 3, na.rm = TRUE)
# Print the result
print(strong_social_support_count)
## [1] 458
I considered 3 as the strong social support. So, there are 458 students feel they have strong social support
# Count the occurrences of students experiencing bullying
bullying_count <- sum(Student_Data_2$bullying == 5, na.rm = TRUE)
# Total number of students
total_students <- nrow(Student_Data_2)
# Calculate the percentage
percentage_experiencing_bullying <- (bullying_count / total_students) * 100
# Print the result
print(percentage_experiencing_bullying)
## [1] 15.72727
Assuming a value of 5 indicates experiencing bullying.15.72% of students have experienced bullying.
extracurricular_participation_count <- sum(Student_Data_2$extracurricular_activities == 1, na.rm = TRUE)
# Print the result
print(extracurricular_participation_count)
## [1] 184
Assuming a value of 1 indicates participation in extracurricular activities,184 students participate in extracurricular activities
##Comparative Analysis:
# Calculate the correlation coefficient
correlation_coefficient <- cor(Student_Data_2$anxiety_level, Student_Data_2$academic_performance, use = "complete.obs")
# Print the result
print(correlation_coefficient)
## [1] -0.6496011
The correlation coefficient of -0.6496011 suggests a moderately strong negative correlation between anxiety levels and academic performance in the dataset.
correlation_coefficient <- cor(Student_Data_2$sleep_quality, Student_Data_2$depression, use = "complete.obs")
# Print the result
print(correlation_coefficient)
## [1] -0.6931609
ggplot(Student_Data_2, aes(x = sleep_quality, y = depression)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Scatter Plot with Regression Line: Sleep Quality vs. Depression Levels", x = "Sleep Quality", y = "Depression Levels")
## `geom_smooth()` using formula = 'y ~ x'
The correlation coefficient of -0.69 indicates a moderately strong
negative relationship between sleep quality and depression levels. As
sleep quality decreases, depression levels tend to increase. This
suggests a notable association between these variables in the
dataset.
contingency_table <- table(Student_Data_2$bullying, Student_Data_2$mental_health_history)
contingency_df <- as.data.frame.matrix(contingency_table)
# Rename the columns for better visualization
colnames(contingency_df) <- c("No Mental Health History", "History of Mental Health Issues")
# Create a stacked bar chart
ggplot(contingency_df, aes(x = rownames(contingency_df), y = `History of Mental Health Issues`, fill = `No Mental Health History`)) +
geom_bar(stat = "identity") +
labs(title = "Bullying vs. Mental Health History",
x = "Bullying",
y = "Count",
fill = "Mental Health History") +
theme_minimal()
Performing Chi-square Test
# Perform a chi-squared test for independence
chi_squared_test <- chisq.test(contingency_table)
# Print the result
print(chi_squared_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 532, df = 5, p-value < 2.2e-16
It appears that students who experience bullying are more likely to have a history of mental health issues based on the data it is according to the test bu from the graph I see no such thing.
##General Exploration:
# Create a summary variable for each factor
psychological_sum <- rowSums(Student_Data_2[, c("self_esteem", "mental_health_history", "depression")])
physiological_sum <- rowSums(Student_Data_2[, c("headache", "blood_pressure", "sleep_quality", "breathing_problem")])
environmental_sum <- rowSums(Student_Data_2[, c("noise_level", "living_conditions", "safety", "basic_needs")])
academic_sum <- rowSums(Student_Data_2[, c("academic_performance", "study_load", "teacher_student_relationship", "future_career_concerns")])
social_sum <- rowSums(Student_Data_2[, c("social_support", "peer_pressure", "extracurricular_activities", "bullying")])
# Create a data frame to store the sums
sums_data <- data.frame(
Factor = c("Psychological", "Physiological", "Environmental", "Academic", "Social"),
Total_Count = c(sum(psychological_sum), sum(physiological_sum), sum(environmental_sum), sum(academic_sum), sum(social_sum))
)
ggplot(sums_data, aes(x = Factor, y = Total_Count, fill = Factor)) +
geom_bar(stat = "identity") +
labs(title = "Total Counts of Negative Experiences or Conditions by Factor",
x = "Factor",
y = "Total Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Calculate relative proportions
sums_data$Relative_Proportion <- sums_data$Total_Count / sum(sums_data$Total_Count)
# Print the updated sums_data
print(sums_data)
## Factor Total_Count Relative_Proportion
## 1 Psychological 33908 0.4263602
## 2 Physiological 11114 0.1397478
## 3 Environmental 11745 0.1476820
## 4 Academic 11761 0.1478832
## 5 Social 11001 0.1383269
# Create a pie chart
library(ggplot2)
ggplot(sums_data, aes(x = "", y = Relative_Proportion, fill = Factor)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "Relative Proportions of Negative Experiences or Conditions by Factor")
Psychological has the highest number of students reporting negative
experiences
# Assuming your_dataset is the name of your dataset
# Create a summary data frame
sums_data <- data.frame(
Factor = c("Psychological", "Physiological", "Environmental", "Academic", "Social"),
Total_Count = c(33908, 11114, 11745, 11761, 11001)
)
# Print the summary data
print(sums_data)
## Factor Total_Count
## 1 Psychological 33908
## 2 Physiological 11114
## 3 Environmental 11745
## 4 Academic 11761
## 5 Social 11001
# Calculate relative proportions
sums_data$Relative_Proportion <- sums_data$Total_Count / sum(sums_data$Total_Count)
# Print the updated sums_data
print(sums_data)
## Factor Total_Count Relative_Proportion
## 1 Psychological 33908 0.4263602
## 2 Physiological 11114 0.1397478
## 3 Environmental 11745 0.1476820
## 4 Academic 11761 0.1478832
## 5 Social 11001 0.1383269
# Visualize relative proportions with a pie chart
library(ggplot2)
ggplot(sums_data, aes(x = "", y = Relative_Proportion, fill = Factor)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "Relative Proportions of Negative Experiences or Conditions by Factor")
# Perform a chi-squared test for independence
contingency_table <- matrix(sums_data$Total_Count, nrow = 1, byrow = TRUE)
chi_squared_test <- chisq.test(contingency_table)
print(chi_squared_test)
##
## Chi-squared test for given probabilities
##
## data: contingency_table
## X-squared = 25499, df = 4, p-value < 2.2e-16
# Select relevant columns
selected_cols <- c("self_esteem", "mental_health_history", "depression", "headache", "blood_pressure",
"sleep_quality", "breathing_problem", "noise_level", "living_conditions", "safety",
"basic_needs", "academic_performance", "study_load", "teacher_student_relationship",
"future_career_concerns", "social_support", "peer_pressure", "extracurricular_activities",
"bullying", "stress_level")
# Subset the dataset
subset_data <- Student_Data_2 %>%
select(all_of(selected_cols))
# Melt the data for analysis
melted_data <- subset_data %>%
gather(key = "Feature", value = "Value", -stress_level)
# Fit linear models for each feature
model_results <- lm(stress_level ~ Value * Feature, data = melted_data)
# Summarize model results
summary_results <- broom::tidy(model_results)
# Plot coefficients and confidence intervals
ggplot(summary_results, aes(x = term, y = estimate)) +
geom_point() +
geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2) +
labs(title = "Impact of Features on Stress Level",
x = "Feature",
y = "Coefficient") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Using Random Forest to find which factors have more influence
# Assuming your dataset is named 'your_dataset'
# Load necessary libraries
library(randomForest)
library(ggplot2)
# Select relevant columns
selected_cols <- c("self_esteem", "mental_health_history", "depression", "headache", "blood_pressure",
"sleep_quality", "breathing_problem", "noise_level", "living_conditions", "safety",
"basic_needs", "academic_performance", "study_load", "teacher_student_relationship",
"future_career_concerns", "social_support", "peer_pressure", "extracurricular_activities",
"bullying", "stress_level")
# Subset the dataset
subset_data <- Student_Data_2 %>%
select(all_of(selected_cols))
# Split the data into training and testing sets
set.seed(123) # for reproducibility
train_indices <- sample(1:nrow(subset_data), 0.8 * nrow(subset_data))
train_data <- subset_data[train_indices, ]
test_data <- subset_data[-train_indices, ]
# Fit a random forest model
rf_model <- randomForest(stress_level ~ ., data = train_data, ntree = 500)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# Print variable importance
print(rf_model$importance)
## IncNodePurity
## self_esteem 68.454547
## mental_health_history 2.531924
## depression 44.165072
## headache 53.925285
## blood_pressure 4.706580
## sleep_quality 85.688590
## breathing_problem 7.072772
## noise_level 12.776804
## living_conditions 5.785769
## safety 28.267338
## basic_needs 40.029329
## academic_performance 32.189123
## study_load 9.066602
## teacher_student_relationship 47.989078
## future_career_concerns 29.924651
## social_support 8.959359
## peer_pressure 20.796274
## extracurricular_activities 36.588251
## bullying 43.549378
I guess this dataset is giving expected results that supports my assumptions.As it shows:Self Esteem, Depression,Heaache,Sleep Quality,Teacher Student Relationship, Extracurrricular Activities and bullying are the factors that affect the stress level ##Visualizing the factors vs stress
# Select relevant columns
selected_cols <- c("self_esteem", "mental_health_history", "depression", "headache", "blood_pressure",
"sleep_quality", "breathing_problem", "noise_level", "living_conditions", "safety",
"basic_needs", "academic_performance", "study_load", "teacher_student_relationship",
"future_career_concerns", "social_support", "peer_pressure", "extracurricular_activities",
"bullying", "stress_level")
# Subset the dataset
subset_data <- Student_Data_2 %>%
select(all_of(selected_cols))
# Function to generate plots with different themes and colors
generate_plots <- function(col, color) {
list(
scatter = ggplot(subset_data, aes(x = .data[[col]], y = stress_level, fill = .data[[col]])) +
geom_point() +
labs(title = paste("Scatter Plot of", col, "against Stress Level"),
x = col,
y = "Stress Level") +
theme_minimal(),
boxplot = ggplot(subset_data, aes(x = 1, y = .data[[col]], fill = .data[[col]])) +
geom_boxplot() +
labs(title = paste("Box Plot of", col),
x = NULL,
y = col) +
theme_minimal(),
violin = ggplot(subset_data, aes(x = 1, y = .data[[col]], fill = .data[[col]])) +
geom_violin() +
labs(title = paste("Violin Plot of", col),
x = NULL,
y = col) +
theme_minimal()
)
}
# Create different types of plots with different colors
for (i in seq_along(selected_cols)) {
col <- selected_cols[i]
color <- rainbow(length(selected_cols))[i]
plot_list <- generate_plots(col, color)
# Save or display the plots
lapply(plot_list, print)
}
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
##Tidying the data
# Replace empty strings with NA
Student_Data_2[Student_Data_2 == ""] <- NA
# Remove rows with missing values
Student_Data_without_missing_data_2<- na.omit(Student_Data_2)
# Check the dimensions of the cleaned data
dim(Student_Data_without_missing_data_2)
## [1] 1100 21
There are no missing values in this dataset For the time being I am now drawing LM against stress_level so that there could be good visualization
# Select relevant columns
selected_cols <- c("self_esteem", "mental_health_history", "depression", "headache", "blood_pressure",
"sleep_quality", "breathing_problem", "noise_level", "living_conditions", "safety",
"basic_needs", "academic_performance", "study_load", "teacher_student_relationship",
"future_career_concerns", "social_support", "peer_pressure", "extracurricular_activities",
"bullying", "stress_level")
# Subset the dataset
subset_data <-Student_Data_without_missing_data_2%>%
select(all_of(selected_cols))
# Create linear regression plots for each feature
for (col in selected_cols) {
if (col != "stress_level") {
lm_plot <- ggplot(subset_data, aes(x = .data[[col]], y = stress_level)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = paste("Linear Regression Plot of", col, "against Stress Level"),
x = col,
y = "Stress Level") +
theme_minimal()
print(lm_plot)
}
}
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
It could be seen that there is positive lm stress_level against these
following featured:Mental Health History,Depression, Headache, Blood
Pressure, Breathing Problem, Noise Level, Study Load, Peer Pressure,
Extracurricular Activities, Bullying, Future Career Concerns, And these
show negative relation: Self esteem, Sleep Quality, Living Conditions,
Safety, Basic Needs, Academic Performance, Teacher Student Relationship,
Social Support
##Modeling
# Step 1: Split the data into training, testing, and validation sets
set.seed(123) # For reproducibility
index <- createDataPartition(Student_Data_2$stress_level, p = 0.7, list = FALSE)
train_data <- Student_Data_2[index, ]
test_validation_data <- Student_Data_2[-index, ]
# Further split test_validation_data into testing and validation sets
index_test <- createDataPartition(test_validation_data$stress_level, p = 0.5, list = FALSE)
test_data <- test_validation_data[index_test, ]
validation_data <- test_validation_data[-index_test, ]
# Convert stress_level to a factor
train_data$stress_level <- as.factor(train_data$stress_level)
test_data$stress_level <- as.factor(test_data$stress_level)
# Step 2: Train a machine learning model (Random Forest in this example)
rf_model <- randomForest(stress_level ~ ., data = train_data)
# Step 3: Make predictions on the testing set
predictions <- predict(rf_model, newdata = test_data)
# Check factor levels in the training dataset
levels_train <- levels(train_data$stress_level)
# Check factor levels in the testing dataset
levels_test <- levels(test_data$stress_level)
# Compare levels
print(levels_train)
## [1] "0" "1" "2"
print(levels_test)
## [1] "0" "1" "2"
# ... (previous code remains unchanged)
# Step 4: Visualize confusion matrix
conf_matrix <- confusionMatrix(predictions, test_data$stress_level)
# Extract confusion matrix as a table
conf_table <- as.table(conf_matrix)
heatmap(conf_table, col = terrain.colors(10), scale = "column",
main = "Confusion Matrix", xlab = "Predicted", ylab = "Actual")
# Calculate class accuracies
class_accuracies <- conf_matrix$byClass
# Convert class_accuracies to a data frame
class_accuracies <- as.data.frame(class_accuracies)
# Print the column names
column_names <- names(class_accuracies)
print(column_names)
## [1] "Sensitivity" "Specificity" "Pos Pred Value"
## [4] "Neg Pred Value" "Precision" "Recall"
## [7] "F1" "Prevalence" "Detection Rate"
## [10] "Detection Prevalence" "Balanced Accuracy"
# Melt the data for better visualization
melted_accuracies <- reshape2::melt(class_accuracies)
## No id variables; using all as measure variables
# Create a bar plot
ggplot(melted_accuracies, aes(x = variable, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Class Accuracies Bar Plot",
x = "Metric",
y = "Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Thankfully this model has preety good accuracy of more then 75% . I’m very glad Dr.Swan helped in achieving this goal. From his suggestions I went ahead and did my analysis using other dataset and this dataset supported my assumptions.